Two Stage Max Gain Content Defined Chunking for De- duplication
نویسنده
چکیده
––Data de-duplication is a very simple concept with very smart technology associated in it. The data blocks are stored only once, de-duplication systems decrease storage consumption by identifying distinct chunks of data with identical content. They then store a single copy of the chunk along with metadata about how to reconstruct the original files from the chunks, this takes up the less storage space than storing the whole thing again. Most of the de-duplication programs are based on fingerprinting. Data de-duplication segments the data into chunks and writes those blocks to disk. And signature is created for each data segment much similar to the fingerprints. And an index is associated for the fingerprint. The index, which can be recreated from the stored data segments. When the repeated new data is encountered in the data stream the de-duplication algorithm inserts a pointer in a metadata which related to the data block which is already present in the data stream. If the same block appears more than once multiple pointers are created for a single block on the disk. This eliminates the redundant blocks in the data stream and improves the storage space efficiency. Content defined chunking means the duplicate blocks are identified based on the internal properties on the data stream. There are many algorithms are in use to identify the repeated content on the data stream. In this paper, a new two stage content based chunking algorithm, the first stage can be any content based algorithm based on the cut points and in the second stage the newly developed max gain algorithm, identifies the new cut points on the repeated data.
منابع مشابه
Survey of Research on Chunking Techniques
The explosive growth of data produced by different devices and applications has contributed to the abundance of big data. To process such amounts of data efficiently, strategies such as De-duplication has been employed. Among the three different levels of de-duplication named as file level, block level and chunk level, De-duplication at chunk level also known as byte level is the most popular a...
متن کاملImplementation of a File System with Encryption and De-duplication
With the rapid advance of society, especially the development of computer technology, network technology and information technology, there is an increasing demand for systems that can provide secure data storage in a cost-effective manner. In this paper, we propose a prototype file system called EDFS (Encryption and De-duplication File System), which provides both data security and space effici...
متن کاملApplication for Data De-duplication Algorithm Based on Mobile Devices
With the rapid development of Mobile Devices, coupled with the rise of the personal cloud service, the business of Cloud Storage and Cloud Synchronization grows rapidly in recent years. As a result, it imposes pressure on the space of the network storage and the network bandwidth, especially in the field of the Mobile Internet. Data De-duplication algorithm can reduce the data redundancy in the...
متن کاملLeap-based Content Defined Chunking - Theory and Implementation
Content Defined Chunking (CDC) is an important component in data deduplication, which affects both the deduplication ratio as well as deduplication performance. The sliding-window-based CDC algorithm and its variants have been the most popular CDC algorithms for the last 15 years. However, their performance is limited in certain application scenarios since they have to slide byte by byte. The a...
متن کاملAccelerating Data Deduplication by Exploiting Pipelining and Parallelism with Multicore or Manycore Processors
As the amount of the digital data grows explosively, Data deduplication has gained increasing attention for its space-efficient functionality that not only reduces the storage space requirement by eliminating duplicate data but also minimizes the transmission of redundant data in data-intensive storage systems. Most existing state-ofthe-art deduplication methods remove redundant data at either ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012